Flash Attention Flash News List

Flash Attention Flash News List | Blockchain.News

Flash News List

List of Flash News about Flash Attention

Time	Details
2026-02-03 21:49	FP8 Training on NVIDIA H100 Cuts Time to GPT-2 to 2.91 Hours and Drops Cost Near 20 Dollars, According to @karpathy According to @karpathy, enabling FP8 training in the nanochat GPT-2 reproduction delivered a 4.3 percent improvement in time to GPT-2, reducing training to 2.91 hours on a single 8x H100 node. According to @karpathy, at spot pricing an 8x H100 run can cost about 20 dollars, while a previous 3.04 hour run cost about 73 dollars, highlighting roughly a 600 times cost reduction versus OpenAI’s original GPT-2 training. According to @karpathy, FP8 on H100 offers 2 times theoretical FLOPs but practical gains are limited by scaling conversion overhead, partial non compute bound training, and small GEMMs at GPT-2 scale, yielding about 7.3 percent per step speedup and roughly 5 percent net after adjusting the training horizon. According to @karpathy, torchao reported a 25 percent FP8 speedup on Llama3 8B, implying larger models may benefit more, and he expects further gains by selectively applying FP8 to layers and tightening numerics. According to @karpathy, additional wins came from Flash Attention 3, the Muon optimizer, gated residual and skip connections, and value embeddings, and he published a reproducible setup and a time to GPT-2 leaderboard on GitHub. Source

Time

Details

2026-02-03
21:49

FP8 Training on NVIDIA H100 Cuts Time to GPT-2 to 2.91 Hours and Drops Cost Near 20 Dollars, According to @karpathy

According to @karpathy, enabling FP8 training in the nanochat GPT-2 reproduction delivered a 4.3 percent improvement in time to GPT-2, reducing training to 2.91 hours on a single 8x H100 node. According to @karpathy, at spot pricing an 8x H100 run can cost about 20 dollars, while a previous 3.04 hour run cost about 73 dollars, highlighting roughly a 600 times cost reduction versus OpenAI’s original GPT-2 training. According to @karpathy, FP8 on H100 offers 2 times theoretical FLOPs but practical gains are limited by scaling conversion overhead, partial non compute bound training, and small GEMMs at GPT-2 scale, yielding about 7.3 percent per step speedup and roughly 5 percent net after adjusting the training horizon. According to @karpathy, torchao reported a 25 percent FP8 speedup on Llama3 8B, implying larger models may benefit more, and he expects further gains by selectively applying FP8 to layers and tightening numerics. According to @karpathy, additional wins came from Flash Attention 3, the Muon optimizer, gated residual and skip connections, and value embeddings, and he published a reproducible setup and a time to GPT-2 leaderboard on GitHub.

Source